Distributed High-performance Web Crawlers: A Survey of the State of the Art
نویسنده
چکیده
Web Crawlers (also called Web Spiders or Robots), are programs used to download documents from the internet. Simple crawlers can be used by individuals to copy an entire web site to their hard drive for local viewing. For such small-scale tasks, numerous utilities like wget exist. In fact, an entire web crawler can be written in 20 lines of Python code. Indeed, the task is inherently simple: the general algorithm is shown in figure 1. However, if one needs a large portion of the web (eg. Google currently indexes over 3 billion web pages), the task becomes astoundingly difficult.
منابع مشابه
Improving the performance of focused web crawlers
This work addresses issues related to the design and implementation of focused crawlers. Several variants of state-of-the-art crawlers relying on web page content and link information for estimating the relevance of web pages to a given topic are proposed. Particular emphasis is given to crawlers capable of learning not only the content of relevant pages (as classic crawlers do) but also paths ...
متن کاملLoad Balancing Approaches for Web Servers: A Survey of Recent Trends
Numerous works has been done for load balancing of web servers in grid environment. Reason behinds popularity of grid environment is to allow accessing distributed resources which are located at remote locations. For effective utilization, load must be balanced among all resources. Importance of load balancing is discussed by distinguishing the system between without load balancing and with loa...
متن کاملEmergent System for Information Retrieval1
Stand alone as well as distributed web crawlers employ high performance, sophisticated algorithms which, on the other hand, require a high degree of computational power. They also use complex interprocess communication techniques (multithreading, shared memory, etc). As opposed to the distributed web crawlers, the ERRIE crawler system presented in this paper displays emergent behavior by employ...
متن کاملDesign and Implementation of a High-Performance Distributed Web Crawler
Broad web search engines as well as many more specialized search tools rely on web crawlers to acquire large collections of pages for indexing and analysis. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. In addition, I/O performance, network resources, and OS limits m...
متن کاملHF-Blocker: Detection of Distributed Denial of Service Attacks Based On Botnets
Abstract—Today, botnets have become a serious threat to enterprise networks. By creation of network of bots, they launch several attacks, distributed denial of service attacks (DDoS) on networks is a sample of such attacks. Such attacks with the occupation of system resources, have proven to be an effective method of denying network services. Botnets that launch HTTP packet flood attacks agains...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003